#install.packages("tidyverse")
#install.packages("palmerpenguins")
#install.packages("patchwork")
#install.packages("ggridges")
#install.packages("gghighlight")
#install.packages("MetBrewer")
#install.packages("ggthemes")
#library(tidyverse)
#library(palmerpenguins)
#libraryy(patchwork)
#library(ggridges)
#library(gghighlight)
#library(MetBrewer)
#library(ggthemes)
Week 3: Data Visualization
{ggplot2}
Greetings!
Eunji Kong
4th year SPED doctoral student
Finished EDS specialization
EDS project
Learning Objectives
- Understand the basic syntax requirements for {ggplot2}
- Recognize various options for displaying data
- Familiarity with various {ggplot2} options/layers
- Basically, how to graph and visualize data
Lecture/Material Structure
PDF Lecture Notes
- Include hyperlinks that take you directly to the relevant topics
- Hyperlinks: everything underlined
- Include hyperlinks that take you directly to the relevant topics
.qmd file (recommended)
Same information as the PDF but allows you to write notes directly in the file
You can also test out code interactively as you follow along
You can then render this document as an HTML file for later review
Visual mode
{tidyverse}
{tidyverse}
is a a meta-package that loads a set of core packages
# If you don't have the package installed
# install.packages("tidyverse")
# load library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
{ggplot2}
- gg stands for “grammar of graphics”
- Resources
- ggplot2 book
- email Dr. Nese for digital copy
- Posit cheat sheet
- can be helpful, perhaps more so after a little experience
- R Graphics Cookbook
- R Graph Gallery
- past students have really liked this one
- ggplot2 book
Components
Every ggplot has three components:
- data
- the data used to produce the plot
- aesthetic mappings (aes)
- between variables and visual properties
- layers(s)
- usually through the geom_*() function plus various other layers
Template
I use the base R’s version of pipe |>
instead of %>%
but they are essentially the same thing.
data |> #pipe here
ggplot(aes(mapping)) + #plus here
geom_function() +
additional layers
Above code is the same as the bottom code.
ggplot(data, aes(mapping)) +
geom_function() +
additional layers
data
# install.packages("palmerpenguins")
library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# str(penguins)
# glimpse(penguins)
# colnames(penguins)
# View(penguins)
ggplot(aes(mapping))
aesthetic mappings describe how variables in the data are mapped to visual properties
Some visual properties include:
x
y
color (will come back to it)
fill (will come back to it)
alpha (will come back to it)
others (linetype, shape, linewidth, size, group)
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g))
QUESTION: What do you see? Why is there nothing plotted?
ANSWER:
Layers
geom_function()
Use a geom_function() to represent data points
Only 1 Variable | Continuous Variable 2 | Discrete Variable 2 | |
Continuous Variable 1 | geom_histogram geom_density |
geom_point geom_smooth geom_line |
geom_density_ridges (from {ggridges}) geom_boxplot geom_violin geom_col |
Discrete Variable 1 | geom_bar | x | geom_count |
Other
Heatmap: geom_tile
geom_histogram()
General research question: How does the values of my continuous variable vary across its range?
Our data specific question: How is the distribution of penguin bill lengths (mm) in this sample? Any outliers? Unimodal?
|>
penguins ggplot(aes(x = bill_length_mm)) + # Remember to use + instead of |> or %>%
geom_histogram()
Color vs Fill
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_histogram(color = "blue")
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_histogram(color = "blue",
fill = "green")
Color = outline
Fill = area
Transparency
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_histogram(color = "blue",
fill = "green",
alpha = 0.2)
Color, fill & alpha in this example area all fixed settings (i.e., applies to all data points).
More aes mapping
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex), # note that fill is inside aes()
alpha = 0.7)
Fill here is a conditional mapping, meaning that the fill color is different based on the variable (in this case the sex of the birds).
Fixed vs Conditional
penguins |>
ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")
In the above example where fill is not within aes(), fill is a fixed setting. Also notice that color is in quotes.
penguins |>
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))
In the above example, aes() is used to access variables and make changes according to a specific variable. Here, fill is a conditional on the variable, sex. Also notice that variables are not in quotes.
<- penguins |>
a ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")
<- penguins |>
b ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))
#install.packages("patchwork")
library(patchwork)
+b a
Be mindful of aes()
penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(fill = “green”)
penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(aes(fill = “green”))
Question: What is wrong with the bottom code? How do you think the plot will look like?
Answer:
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = "green"))
geom_density()
General research question: How does the probability density of my continuous variable vary across its range?
Think of it as a smoothed histogram
- Difference: not use bins; not use count but use relative frequency per unit of x
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density()
More aes mapping
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex))
Add transparency for clarity
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex), alpha = 0.5)
Histogram vs Density
<- penguins |>
a ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex), alpha = 0.5)
<- penguins |>
b ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex), alpha = 0.5)
+ b a
Question: What is the difference that you see? When would you use one vs another?
Answer:
facet_wrap
wrap by 1 variable
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex), alpha = 0.5) +
facet_wrap(~sex) # remember to use ~
wrap by 2 variables
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex), alpha = 0.5) +
facet_wrap(year~sex)
wrap using vars()
|>
penguins ggplot(aes(x = bill_length_mm)) +
geom_density(aes(fill = sex), alpha = 0.5) +
facet_wrap(vars(year,sex))
geom_density_ridges()
geom_density_ridges: two variables
# install.packages("ggridges")
library(ggridges)
|>
penguins ggplot(aes(bill_length_mm, sex)) +
geom_density_ridges()
geom_point()
General research question: How are two numeric variables related? (raw observations)
Our data specific question: What is the relationship between penguin’s bill length and body mass?
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point()
Add color
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(color = "magenta")
Emphasize specific data points (island = Torgersen)
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(color = "magenta") +
geom_point(data = filter(penguins, island == "Torgersen"), color = "blue")
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(data = filter(penguins, island == "Torgersen"), color = "blue") +
geom_point(color = "magenta")
Question: What happened when we switched the order of the geom_points?
Answer:
Emphasize another way
# install.packages("gghighlight")
|>
penguins ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(color = "magenta") +
::gghighlight(island == "Torgersen") gghighlight
geom_smooth()
General research question: What is the pattern of relationship of two continuous variables? (trend)
Our data specific question: What is the trend or pattern of relationship between penguin’s bill length and body mass?
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Method
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_smooth(method = "lm")
No need to include “x =” or “y =” because ggplot assumes the first argument will be x and then y.
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_smooth(method = "lm", level = .65)
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_smooth(method = "lm", se=FALSE)
Note: This is not the same as geom_line(). We are fitting a line of best fit with geom_smooth()
Adding Layers
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
Global
If we use something like color = “x” in the first aesthetic, it will carry on through all additional layers.
|>
penguins ggplot(aes(bill_length_mm, body_mass_g, color = species)) + #color = spieces
geom_point() +
geom_smooth(method = "lm")
|>
penguins ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
geom_point(aes(color = species)) +
geom_smooth(method = "lm", aes(color = species))
Local
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) + #color = spieces
geom_smooth(method = "lm")
geom_line()
geom_point: raw observations, not linked
geom_smooth: trend/pattern
geom_line: raw data linked
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point() +
geom_smooth(method = "lm") +
geom_line()
When should you use line plots?
Usually when time is involved
One time point per line or per group
Shows linkage
# Original data
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# Create new data set so that there is only one data point per year
<- penguins |>
penguins_year group_by(year) |>
summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))
head(penguins_year)
# A tibble: 3 × 2
year avg_bill
<int> <dbl>
1 2007 43.7
2 2008 43.5
3 2009 44.5
|>
penguins_year ggplot(aes(year, avg_bill)) +
geom_line()
# Create new data set so that there is one data point for each year for each species
<- penguins |>
penguins_year_species group_by(year, island) |>
summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))
head(penguins_year_species)
# A tibble: 6 × 3
# Groups: year [2]
year island avg_bill
<int> <fct> <dbl>
1 2007 Biscoe 45.0
2 2007 Dream 44.5
3 2007 Torgersen 38.8
4 2008 Biscoe 44.6
5 2008 Dream 43.8
6 2008 Torgersen 38.8
|>
penguins_year_species ggplot(aes(year, avg_bill, group = island, color = island)) +
geom_line()
geom_boxplot()
General research question: How is a continuous variable distributed across groups, and how do the medians, quartiles, and potential outliers compare?
|>
penguins ggplot(aes(species, body_mass_g)) +
geom_boxplot()
geom_violin()
General research question: How is the full distribution of a continuous variable shaped across groups?
|>
penguins ggplot(aes(species, body_mass_g)) +
geom_violin()
geom_bar()
geom_bar() vs geom_col()
geom_bar() | geom_col() |
---|---|
|
|
|
|
|>
penguins ggplot(aes(species)) + # one variable in the `aes()`
geom_bar()
geom_col()
<- penguins |>
summarized_penguins group_by(species) |>
summarize(N = n())
head(summarized_penguins)
# A tibble: 3 × 2
species N
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
|>
summarized_penguins ggplot(aes(species, N)) +
geom_col()
More aes mapping
<- penguins |>
summarized_penguins2 group_by(species, sex) |>
na.omit() |>
summarize(bill_length_avg = mean(bill_length_mm))
summarized_penguins2
# A tibble: 6 × 3
# Groups: species [3]
species sex bill_length_avg
<fct> <fct> <dbl>
1 Adelie female 37.3
2 Adelie male 40.4
3 Chinstrap female 46.6
4 Chinstrap male 51.1
5 Gentoo female 45.6
6 Gentoo male 49.5
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
geom_col(aes(fill = sex))
Position
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
geom_col(aes(fill = sex), position = "dodge")
coord_flip
ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
geom_col(aes(fill = sex), position = "dodge") +
coord_flip()
geom_count()
General research question: How many observations fall in each category pair?
Our data specific question: How many of each species live in each island?
|>
penguins ggplot(aes(species, island)) +
geom_count()
Scales
|>
penguins ggplot(aes(species, island)) +
geom_count(aes(color = after_stat(n)))+
scale_color_gradient(low = "lightblue", high = "brown")
What do scales do?
Scales control how the mappings you added to aes are displayed (e.g., color range, size range, breaks and labels, range or limits)
Template: scale_*
Most aes mappings: x, y, size, color, fill, line, alpha, etc
Colorblind friendly
|>
penguins ggplot(aes(species, island)) +
geom_count(aes(color = after_stat(n)))+
scale_color_viridis_c()
|>
penguins ggplot(aes(species, island)) +
geom_count(aes(color = after_stat(n)))+
scale_color_viridis_c(option = "turbo") #magma, interno, plasma, viridis, cividis, rocket, mako, turbo or A-H
- {MetBrewer} - inspired by art in the Met
#install.packages("MetBrewer")
|>
penguins ggplot(aes(species, island)) +
geom_count(aes(color = after_stat(n)))+
scale_color_gradientn(colors=MetBrewer::met.brewer("Isfahan1"))
geom_tile()
General research question: What’s the value of a numerical measure (Z) for each (X, Y) pair? In other words, what is the correlation (Z) for each X,Y pair?
<- penguins |>
corr select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
drop_na() |>
cor()
<- corr |>
pc as.data.frame() |>
rownames_to_column(var = "row") |>
pivot_longer(
cols = -row,
names_to = "col",
values_to = "cor")
head(pc)
# A tibble: 6 × 3
row col cor
<chr> <chr> <dbl>
1 bill_length_mm bill_length_mm 1
2 bill_length_mm bill_depth_mm -0.235
3 bill_length_mm flipper_length_mm 0.656
4 bill_length_mm body_mass_g 0.595
5 bill_depth_mm bill_length_mm -0.235
6 bill_depth_mm bill_depth_mm 1
ggplot(pc, aes(row, col, fill = cor)) +
geom_tile()
ggplot(pc, aes(row, col, fill = cor)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(pc, aes(row, col, fill = cor)) +
geom_tile() +
scale_fill_viridis_c()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Other Layers
labels
axis labels
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species))
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)")
title
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass")
subtitle
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle="Grouped by species")
caption
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle="Grouped by species",
caption = "palmerpenguins")
tag
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle="Grouped by species",
caption = "palmerpenguins",
tag = "(A)")
legend (one way)
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle="Grouped by species",
caption = "palmerpenguins",
tag = "(A)",
color="SPECIES!")
theme
The default is theme_gray(). There are a lot of built-in alternative in {ggplot2}. My go-to is theme_minimal() because it is clean without a lot of unnecessary visuals.
If you want to set theme globally (meaning to all your graphs in your document), add theme_set(theme_minimal()) to the first line after you load your libraries.
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x="Bill length (mm)",
y="Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle="Grouped by species",
caption = "palmerpenguins",
tag = "(A)",
color="SPECIES!") +
theme_minimal()
Other packages:
#install.packages("ggthemes")
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color=species)) +
::theme_economist()+
ggthemes::scale_color_economist() ggthemes
|>
penguins ggplot(aes(bill_length_mm, body_mass_g)) +
geom_point(aes(color = species)) +
labs(x = "Bill length (mm)",
y = "Body mass (g)",
title = "Relationship between bill length and body mass",
subtitle = "Grouped by species",
caption = "palmerpenguins",
tag = "(A)",
color = "SPECIES!") +
theme(plot.title = element_text(size=13, face="bold", hjust =0.5),
axis.title = element_text(size=11, family="Georgia"),
axis.text.x = element_text(size=10, angle = 45, hjust=1),
panel.background = element_rect(fill = "grey95"),
plot.background = element_rect(fill = "white"),
panel.grid.major = element_line(color = "black"),
panel.grid.minor = element_blank(),
legend.position = "top",
legend.title = element_text(face="bold"),
legend.background = element_rect(fill = "transparent"))
Practice together
1
Get to know the data - str(mpg) or head(mpg)
2
What is the overall distribution of city fuel efficiency (mpg) across car models?
3
How does the distribution vary by drivetrain type (e.g., front-, rear-, 4-wheel drive)?
4
What is the relationship between city and highway mpg?
5
Can we focus on/emphasize Audi’s relationship?
6
Can we have larger points for clarity?
7
How are the city/hwy mpg relationships different by car class?
8
Too much clutter. Can we just see trends?
9
Still too much clutter. Better way to clearly see each trends?
10
Can we make it colorblind friendly?
11
Can we clarify axis and legend labels?
12
Can we polish the appearance with a theme?